Aim
R language and its packages ecosystem are wonderful tools for data analysis. In community ecology field, many packages can be used for the ecological and evolutionary analysis, such as vegan[1], ape[2] and picante[3]. In microbial community ecology, with the development of the high-throughput sequencing techniques, the increasing data amount and complexity make the community data analysis a challenge. There have been some R packages created for the community data analysis in microbial ecology, such as phyloseq[4], microbiomeSeq, ampvis2, mare and microbiome. However, we lack a flexible and modularized R package to analyze and manage all the data. Here we created a microeco R package for this goal. This package provides powerful plotting and statistical approaches.
Main Features
- R6 Class to store and analyze data; fast, flexible and modularized
- Plotting the taxonomic abundance
- Venn diagram
- Alpha diversity
- Beta diversity
- Differential abundance analysis
- Environmental data analysis
- Network analysis
- Null model analysis
- Functional analysis
Class introduction
All the classes in microeco package depend on the R6 class. R6 uses the encapsulated object-oriented programming paradigm, which means that R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects, rather than generic functions. If you are interested in the class feature, read more from ‘Advanced R’ book:
A generic is a regular function, so it lives in the global namespace. An R6 method belongs to an object so it lives in a local namespace. This influences how we think about naming. The methods belong to objects, not generics, and you call them like object$method().
R6’s reference semantics allow methods to simultaneously return a value and modify an object.
You can invoke an R6 method using $, which is an infix operator. If you set up your methods correctly, you can use chains of method calls as an alternative to the pipe. Every R6 object has an S3 class that reflects its hierarchy of R6 classes.
Details
microtable class
Many tools can be used for the bioinformatic analysis, such as QIIME[5], usearch(https://www.drive5.com/usearch/), mothur[6], and RDP(http://rdp.cme.msu.edu/). Although the format of results may be different in various tools, the main files can be classified as the following parts: 1, OTU table, the traditional species-sample table; 2, taxonomy table, the OTU taxonomy assignments information table; 3, phylogenetic tree. Generally, it is necessary to create a sample information table to store all the detailed sample information, including the environmental data and the missing values (NA). The funtions read_qiime_otu_table() and split_assignments() in package qiimer are appropriate for the parsing of raw QIIME OTU table file.
We use microtable class to store all the required data for the following analysis. We use the example data in the package to show the tutorials. The dataset is the 16S rRNA gene Miseq sequencing results of wetland soils in China published by An et al.[7], who surveyed soil bacterial communities in Chinese inland wetlands (IW), coastal wetland (CW) and Tibet plateau wetlands (TW) using 16S rRNA gene amplicon sequencing method. These wetlands included both saline and non-saline samples. The sample information table have 4 columns: “SampleID”, “Group”, “Type” and “Saline”. The column “SampleID” is same with the rownames. The column “Group” represents the IW, CW and TW. The column “Type” represents the sampling area: northeastern region (NE), northwest region (NW), North China area (NC), middle-lower reaches of the Yangtze River (YML), southern coastal area (SC), upper reaches of the Yangtze River (YU), Qinghai-Tibet Plateau (QTP). The column “Saline” represents the saline soils and non-saline soils. In this dataset, the env_data table is separated from the sample information table.
library(microeco)
# load the example data
data(otu_table)
data(taxonomy_table)
data(sample_info)
data(phylo_tree)
data(env_data)
# set.seed is used to fix the random number generation to make the results repeatable
set.seed(123)
# make the plotting background same with the tutorial
theme_set(theme_bw())[1] “data.frame”
| 101 | 102 | 103 | 104 | 105 | |
|---|---|---|---|---|---|
| OTU_4272 | 4 | 0 | 2 | 1 | 0 |
| OTU_236 | 3 | 0 | 0 | 0 | 3 |
| OTU_399 | 39 | 51 | 59 | 20 | 22 |
| OTU_1556 | 10 | 8 | 10 | 4 | 8 |
| OTU_32 | 37 | 41 | 3 | 19 | 39 |
[1] “data.frame”
| Kingdom | Phylum | Class | |
|---|---|---|---|
| OTU_4272 | k__Bacteria | p__Firmicutes | c__Bacilli |
| OTU_236 | k__Bacteria | p__Chloroflexi | c__ |
| OTU_399 | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria |
| OTU_1556 | k__Bacteria | p__Acidobacteria | c__Acidobacteria |
| OTU_32 | k__Archaea | p__Miscellaneous Crenarchaeotic Group | c__ |
| Order | Family | Genus | Species | |
|---|---|---|---|---|
| OTU_4272 | o__Bacillales | f__ | g__ | s__ |
| OTU_236 | o__ | f__ | g__ | s__ |
| OTU_399 | o__Nitrosomonadales | f__Nitrosomonadaceae | g__ | s__ |
| OTU_1556 | o__Subgroup 17 | f__ | g__ | s__ |
| OTU_32 | o__ | f__ | g__ | s__ |
[1] “data.frame”
| SampleID | Group | Type | Saline |
|---|---|---|---|
| 1 | IW | NE | Non-saline soil |
| 2 | IW | NE | Non-saline soil |
| 3 | IW | NE | Non-saline soil |
| 4 | IW | NE | Non-saline soil |
| 5 | IW | NE | Non-saline soil |
[1] “data.frame”
| Latitude | Longitude | Altitude | Temperature | Precipitation |
|---|---|---|---|---|
| 52.96 | 122.6 | 432 | -4.2 | 445 |
| 52.95 | 122.6 | 445 | -4.3 | 449 |
| 52.95 | 122.6 | 430 | -4.3 | 449 |
| 52.95 | 122.6 | 430 | -4.3 | 449 |
| 52.95 | 122.6 | 429 | -4.3 | 449 |
[1] “phylo”
Then, we make an object using microtable class. This operation is very similar with that in package phyloseq[4], but microeco is simpler and faster.
# As an example, we only use 90 samples, 30 samples in each group (CW, IW, TW), see "Group" column in sample_info table.
# obtain the sample names
use_sample_names <- lapply(unique(sample_info$Group), function(x) rownames(sample_info[sample_info$Group == x, ])[1:30]) %>% unlist
# In R6 class, '$new' is the original method used to create a new object of class
dataset <- microtable$new(sample_table = sample_info[use_sample_names, ], otu_table = otu_table, tax_table = taxonomy_table, phylo_tree = phylo_tree)
class(dataset)[1] “microtable” “R6”
microtable class: sample_table have 90 rows and 4 columns otu_table have 14113 rows and 200 columns tax_table have 14113 rows and 7 columns phylo_tree have 14096 tips
To make the species and sample information consistent across different files within the dataset object, we can use function tidy_dataset() to trim the dataset.
microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13628 rows and 7 columns phylo_tree have 13628 tips
Then, we remove OTUs which are not assigned in the Kingdom "k__Archaea" or "k__Bacteria".
dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
print(dataset)microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13330 rows and 7 columns phylo_tree have 13628 tips
We also remove OTUs with the taxonomic assignments “mitochondria” or “chloroplast”.
# This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.
# So if you want to filter some taxa not considerd pollutions, please use subset.
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
print(dataset)microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13628 tips
Then, to make the OTUs same in otu_table, tax_table and phylo_tree, we use tidy_dataset() again.
microtable class: sample_table have 90 rows and 4 columns otu_table have 13296 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13296 tips
Then we use sample_sums() to check the sequence numbers in each sample.
[1] 10316 37087
Sometimes, in order to reduce the effects of species number on the diversity measurements, we need to perform the resampling to make the sequence number equal for each sample.
# As the example, we use 10000 sequences in each sample
dataset$rarefy_samples(sample.size = 10000)
dataset$sample_sums() %>% range[1] 10000 10000
Then, we calculate the taxa abundance at each taxonomic rank using cal_abund(). This function return a list called taxa_abund containing several data frame of the abundance information at each taxonomic rank. The list is stored in the object microtable automatically.
[1] “list”
If you want to save the taxa abundance file to the computer, use save_abund().
Then, we calculate the alpha diversity. The result is also stored in the object microtable automatically. As an example, we do not calculate phylogenetic diversity.
# If you want to add Faith's phylogenetic diversity, use PD = TRUE
dataset$cal_alphadiv(PD = FALSE)
# save dataset$alpha_diversity to the computer
dataset$save_alphadiv(dirpath = "alpha_diversity")We also calculate the beta diversity (distance matrix) using function cal_betadiv(). We provide four most frequently used indexes: Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac.
trans_abund class
This class is used to transform abundance data for plotting the taxa abundance. We first use this class for the bar plot. We use the highest 10 Phyla in dataset.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)
# t1 now include the transformed abundance data t1$abund_data and other elements for the following plottingWe do not retain the sample names in x axis and add the facet to show group.
The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)Then alluvial plot is implemented in the plot_bar function.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)
# use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
t1$plot_bar(use_alluvium = TRUE, clustering = TRUE, xtext_type_hor = FALSE, xtext_size = 6)The box plot is a good way to intuitionally show data distribution across groups.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Class", ntaxa = 15)
t1$plot_box(group = "Group")We also show the heatmap using the high abundant genera.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 40)
t1$plot_heatmap(facet = "Group", xtext_keep = FALSE, withmargin = FALSE)Then, we show the pie chart.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 6, groupmean = "Group")
t1$plot_pie(facet_nrow = 1)trans_venn class
The trans_venn class is used for venn plotting. We first merge samples according to the “Group” column of sample_table.
# merge samples to one community for each group
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1, ratio = "seqratio")
t1$plot_venn()
# The integer data is OTU number
# The percentage data is the sequence number/total sequence numberWhen the groups are too many to show using venn plot, we can use petal plot.
dataset1 <- dataset$merge_samples(use_group = "Type")
t1 <- trans_venn$new(dataset1)
t1$plot_venn(petal_plot = TRUE)Sometimes, we are interested in the particular and shared species. In another words, the composition of the particular or shared species may account for the different and similar parts of ecological characteristics across groups[8]. For this goal, we first transform the results of venn plot to the traditional species-sample table, that is, another object of microtable class.
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1)
# transform venn results to the sample-species table, do not consider abundance, only use frequency.
t2 <- t1$trans_venn_com(use_OTUs_frequency = TRUE)
class(t2)[1] “microtable” “R6”
We use bar plot to show the composition at the Genus level.
# calculate taxa abundance, that is, the frequency
t2$cal_abund()
# transform and plot; $ is like the pipe %>%
trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 10)$
plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE)We also try to use pie chart to show the compositions at the Phylum level.
trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8)$
plot_pie(facet_nrow = 3, use_colors = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))trans_alpha class
Alpha diversity can be transformed and plotted using trans_alpha class. Creating this class can return two data frame-alpha_data and alpha_stat.
| Group | Measure | N | Mean | SD | SE |
|---|---|---|---|---|---|
| CW | Observed | 30 | 1843 | 220.6 | 40.27 |
| CW | Chao1 | 30 | 2553 | 338.1 | 61.73 |
| CW | ACE | 30 | 2716 | 367 | 67.01 |
| CW | Shannon | 30 | 6.308 | 0.5355 | 0.09777 |
| CW | Simpson | 30 | 0.9897 | 0.01305 | 0.002382 |
Then, we test the differences among groups using the KW rank sum test and anova with multiple comparisons.
| Groups | Measure | Test method | p.value | Significance |
|---|---|---|---|---|
| IW vs CW | Observed | KW | 0.0371 | * |
| IW vs TW | Observed | KW | 0.4553 | |
| CW vs TW | Observed | KW | 0.3912 | |
| IW vs CW vs TW | Observed | KW | 0.155 | |
| IW vs CW | Chao1 | KW | 0.002689 | ** |
| Observed | Chao1 | ACE | Shannon | Simpson | InvSimpson | Fisher | Coverage | |
|---|---|---|---|---|---|---|---|---|
| IW | a | a | a | a | a | a | a | b |
| TW | a | ab | b | a | a | a | a | a |
| CW | a | b | b | a | a | a | a | a |
Now, let us plot the mean and se of alpha diversity for each group, and add the agricolae::duncan.test result.
We can also use the boxplot to show the paired comparisons directly.
trans_beta class
Beta diversity can be transformed and plotted using trans_beta class. The analysis referred to the beta diversity in this class mainly include ordination, group distance and manova. We first show the ordination using PCoA.
# we first create an object and select PCoA for ordination
t1 <- trans_beta$new(dataset = dataset, group = "Group", measure = "bray", ordination = "PCoA")Then we plot and compare the group distances.
# plot sample distances within groups
t1$cal_group_distance()
t1$plot_group_distance(distance_pair_stat = TRUE)# plot sample distances between groups
t1$cal_group_distance(within_group = FALSE)
t1$plot_group_distance(distance_pair_stat = TRUE)Clustering plot is also a frequently used method.
# use replace_name to set the label name, group parameter used to set the color
t1$plot_clustering(group = "Group", replace_name = c("Saline", "Type"))perMANOVA is often used in the differential test of distances among groups.
| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Group | 2 | 6.121 | 3.06 | 10.57 | 0.1955 | 0.001 |
| Residuals | 87 | 25.18 | 0.2895 | NA | 0.8045 | NA |
| Total | 89 | 31.3 | NA | NA | 1 | NA |
| Groups | measure | permutations | R2 | p.value | Significance |
|---|---|---|---|---|---|
| IW vs CW | bray | 999 | 0.1595 | 0.001 | *** |
| IW vs TW | bray | 999 | 0.147 | 0.001 | *** |
| CW vs TW | bray | 999 | 0.1556 | 0.001 | *** |
# manova for specified group set: here "Group + Type"
t1$cal_manova(cal_manova_set = "Group + Type")
# t1$res_manova$aov.tab| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Group | 2 | 6.121 | 3.06 | 12.01 | 0.1955 | 0.001 |
| Type | 3 | 3.783 | 1.261 | 4.949 | 0.1208 | 0.001 |
| Residuals | 84 | 21.4 | 0.2548 | NA | 0.6836 | NA |
| Total | 89 | 31.3 | NA | NA | 1 | NA |
trans_diff class
Differential abundance test is a very important part in the microbial community data analysis. It can be used to find the significant taxa in determining the community differences across groups. Currently, trans_diff class have three famous approaches to perform this analysis: metastat, rf and LEfSe[9]. Metastat[10] depends on the permutations and t-test and performs well on the sparse data. It is used for the comparisons between two groups.
# metastat Genus level
t1 <- trans_diff$new(dataset = dataset, method = "metastat", group = "Group", metastat_taxa_level = "Genus")
# plot the first paired groups, choose_group = 1
t1$plot_metastat(use_number = 1:10, qvalue = 0.05, choose_group = 1)LEfSe combines the non-parametic test and LDA and depends on the python. Now we rewrite this approach in this package.
t1 <- trans_diff$new(dataset = dataset, method = "lefse", group = "Group", alpha = 0.01, lefse_subgroup = NULL)
t1$plot_lefse_bar(LDA_score = 4)| Taxa | Group | pvalue | LDA |
|---|---|---|---|
| k__Bacteria|p__Proteobacteria | CW | 3.21e-11 | 4.852 |
| k__Bacteria|p__Acidobacteria|c__Acidobacteria | IW | 8.559e-13 | 4.786 |
| k__Bacteria|p__Acidobacteria | IW | 5.749e-12 | 4.785 |
| k__Bacteria|p__Bacteroidetes | TW | 1.19e-09 | 4.767 |
| k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria | CW | 5.475e-12 | 4.628 |
We can also plot the abundance of taxa detected using LEfSe.
Then, we show the cladegram of the differential features in the taxonomic tree. There are too many taxa in this dataset. As an example, we only use the highest 200 abundant taxa in the tree and 50 differential features. We only show the full taxonomic label at Phylum level and use letters at other levels to reduce the text overlap.
# clade_label_level 5 represent phylum level in this analysis
# require ggtree package
t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, clade_label_level = 5)There may be a problem related with the taxonomic labels in the plot. When the levels used are too many, the taxonomic labels may have too much overlap. However, if only presenting the Phylum level, the taxa in the legend with marked letters are too many. At this time, you can select the taxa that you want to show in th plot manually like the following operation.
# choose some taxa according to the positions in the previous picture; those taxa labels have minimum overlap
use_labels <- c("c__Deltaproteobacteria", "c__Actinobacteria", "o__Rhizobiales", "p__Proteobacteria", "p__Bacteroidetes",
"o__Micrococcales", "p__Acidobacteria", "p__Verrucomicrobia", "p__Firmicutes",
"p__Chloroflexi", "c__Acidobacteria", "c__Gammaproteobacteria", "c__Betaproteobacteria", "c__KD4-96",
"c__Bacilli", "o__Gemmatimonadales", "f__Gemmatimonadaceae", "o__Bacillales", "o__Rhodobacterales")
# then use parameter select_show_labels to show
t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, select_show_labels = use_labels)
# Now we can see that more taxa names appeare in the treeIf you are interested in taxonomic tree, you can also use metacoder package[Foster_Metacoder_2017] to plot the taxonomic tree based on the selected taxa. We do not show the usage here.
The third approach is called rf, which depends on the random forest[11, 12] and the non-parametic test. The current method calculates random forest by bootstrapping like the method in LEfSe and only use the significant features.
# use Genus level for parameter rf_taxa_level, if you want to use all taxa, change to "all"
t1 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
t2 <- t1$plot_diff_abund(use_number = 1:20, only_abund_plot = FALSE)
gridExtra::grid.arrange(t2$p1, t2$p2, ncol=2, nrow = 1, widths = c(2,2))
# the middle asterisk represent the significancestrans_env class
The environmental variables are very useful in analyzing microbial community structure and assembly mechanisms. We first show the RDA analysis using both the db-RDA and RDA.
# add_data is used to add the environmental data
t1 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])# use bray-curtis distance to do dbrda
t1$cal_rda(use_dbrda = TRUE, use_measure = "bray")
t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 10)
t1$plot_rda(plot_color = "Group")# use Genus
t1$cal_rda(use_dbrda = FALSE, taxa_level = "Genus")
# As the main results of RDA are related with the projection and angles between different arrows,
# we adjust the length of the arrow to show them clearly.
t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 1500, max_perc_tax = 3000, min_perc_env = 200, min_perc_tax = 300)
# show important genus
t1$plot_rda(plot_color = "Group")Mantel test can be used to check whether there is significant correlations between environmental variables and distance metrics.
| variable_name | cor_method | corr_res | p_res | significance |
|---|---|---|---|---|
| Temperature | pearson | 0.452 | 0.001 | *** |
| Precipitation | pearson | 0.2791 | 0.001 | *** |
| TOC | pearson | 0.13 | 0.003 | ** |
| NH4 | pearson | -0.05539 | 0.922 | |
| NO3 | pearson | 0.06758 | 0.049 | * |
| pH | pearson | 0.4085 | 0.001 | *** |
| Conductivity | pearson | 0.2643 | 0.001 | *** |
| TN | pearson | 0.1321 | 0.002 | ** |
The correlations between environmental variables and taxa are important in analyzing or inferring the factors affecting community structure. In this example, we first perform the differential abundance test and random forest analysis to obtain the important genera. Then we use those taxa to perform correlation analysis.
t2 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
t1 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])
t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:40])Then, we can plot the correlation results using ggplot2 or pheatmap.
If you are concerned with the relationship between environmental factors and alpha diversity, you can also use this function.
t1 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])
# use add_abund_table parameter to add the extra data table
t1$cal_cor(add_abund_table = dataset$alpha_diversity)
t1$plot_corr()trans_network class
Network is a frequently used approach to study the co-occurrence patterns in microbial ecology[13–15]. In this part, we describe all the core contents in the trans_network class. The network construction approaches can be classified into two types: correlation-based and non correlation-based. Several approaches can be used to calculate correlations and significances. So we create a class called trans_corr to calculate correlations. trans_network class inherits the trans_corr class.
We first introduce the correlation-based network. The parameter cal_cor in trans_network controls the correlation calculation method.
# Use R base cor.test, slow
t1 <- trans_network$new(dataset = dataset, cal_cor = "base", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")# SparCC method, you can use nThreads parameter to increase CPU threads to accelerate the calculation, see trans_corr class documentation.
t1 <- trans_network$new(dataset = dataset, cal_cor = "SparCC", taxa_level = "OTU", filter_thres = 0.0001, SparCC_simu_num = 100)# When the OTU number is large, use R WGCNA package to replace R base to calculate correlations
t1 <- trans_network$new(dataset = dataset, cal_cor = "WGCNA", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")# contruct network; require igraph package; return t1$res_network
# use RMT to find optimal correlation coefficient
t1$cal_network(p_thres = 0.01, COR_optimization = TRUE)# use arbitrary coefficient threshold to contruct network
t1$cal_network(p_thres = 0.01, COR_cut = 0.7)We plot the network and present the node colors according to the calculated modules in Gephi.
Now, we show the node colors with the Phylum information and the edges colors with the positive and negative correlations. All the data we used have been stored in the network.gexf file, including modules classifications, Phylum information and edges classifications.
# calculate network attributes; return t1$res_network_attr
t1$cal_network_attr()
# t1$res_network_attr| Property | Value |
|---|---|
| Vertex | 407 |
| Edge | 1989 |
| Average_degree | 9.774 |
| Average_path_length | 3.878 |
| Network_diameter | 9 |
| Clustering_coefficient | 0.4698 |
| Density | 0.02407 |
| Heterogeneity | 1.194 |
| Centralization | 0.09908 |
# classify the node; return t1$res_node_type
t1$cal_node_type()
# we retain the file for the following use in trans_func part
network_node_type <- t1$res_node_type
# t1$res_node_type[1:10, c(1:4, 12:13)]| z | module | p | taxa_roles | degree | |
|---|---|---|---|---|---|
| OTU_50 | -1.305 | M2 | 0 | Peripheral nodes | 2 |
| OTU_1 | -0.04067 | M2 | 0 | Peripheral nodes | 21 |
| OTU_55 | -1.239 | M2 | 0 | Peripheral nodes | 3 |
| OTU_13824 | -0.2403 | M2 | 0 | Peripheral nodes | 18 |
| OTU_151 | -1.372 | M2 | 0.4444 | Peripheral nodes | 1 |
| OTU_4731 | 1.29 | M2 | 0 | Peripheral nodes | 41 |
| OTU_390 | 0.4918 | M2 | 0 | Peripheral nodes | 29 |
| OTU_67 | -0.9724 | M2 | 0.4764 | Peripheral nodes | 7 |
| OTU_13463 | -1.372 | M2 | 0 | Peripheral nodes | 1 |
| OTU_357 | 0.2255 | M2 | 0.32 | Peripheral nodes | 25 |
| betweenness | |
|---|---|
| OTU_50 | 0 |
| OTU_1 | 2 |
| OTU_55 | 0 |
| OTU_13824 | 8 |
| OTU_151 | 0 |
| OTU_4731 | 2782 |
| OTU_390 | 178 |
| OTU_67 | 12 |
| OTU_13463 | 0 |
| OTU_357 | 3280 |
# calculate the links between or within taxonomic ranks
t1$cal_sum_links(taxa_level = "Phylum")
# plot the summed links; require chorddiag package
t1$plot_sum_links(plot_pos = TRUE, plot_num = 10)The subset_network() function can be used to take a part of nodes and edges among these nodes from the network. In this function, you should provide the nodes you need using the node parameter.
# this return a sub network that contains all nodes of module M1
t1$subset_network(node = t1$res_node_type %>% .[.$module == "M1", ] %>% rownames, rm_single = TRUE)We also introduce another network construction approach: Probabilistic Graphical Models (PGM), which is implemented in julia package FlashWeave[16]. If you want to use this method like the following code, you should first install julia language in your computer and the FlashWeave package, and add the julia in the computer path.
trans_nullmodel class
To make the use of null model convenient, we incorporate several null models and phylogenetic analysis[17]. We first check the phylogenetic signal.
# generate object; use 1000 OTUs as example; use cpp to accelerate betaMPD calculation
t1 <- trans_nullmodel$new(dataset, taxa_number = 1000, add_data = env_data, cpp = TRUE)betaNRI(ses.betampd) is used to show the ‘basal’ phylogenetic turnover[18].
If we want to plot the betaNRI, we can use plot_group_distance function in trans_beta class. The results showed that the mean betaNRI of TW is extremely and significantly larger that those in CW and IW, revealing that the basal phylogenetic turnover in TW is high, if no strong dispersal limitation exists among the wetlands of different areas in China.
t2 <- trans_beta$new(dataset = dataset, group = "Group")
# must assign use_matrix value. This matrix is the default matrix to store the beta-diversity matrix and calculate group distance.
dataset$beta_diversity[["betaNRI"]] <- t1$res_ses_betampd
# create trans_beta class, use measure "betaNRI"
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
# transform the distance for each group
t2$cal_group_distance()
g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)Sometimes, if you want to perform null model analysis for each group individually, such as one group as one species pool, you can calculate the results for each group, respectively. We can find that, when we perform betaNRI for each group respectively, the mean betaNRI between CW and TW are not significantly different, and they are both significantly higher than that in IW, revealing that the strength of variable selection in CW and TW may be similar under the condition that each area owns a specific species pool.
# we create a list to store the trans_nullmodel results.
sesbeta_each <- list()
group_col <- "Group"
all_groups <- unique(dataset$sample_table[, group_col])
# calculate for each group, respectively
for(i in all_groups){
# like the above operation, but need provide 'group' and 'select_group'
test <- trans_nullmodel$new(dataset, group = group_col, select_group = i, taxa_number = 1000, add_data = env_data, cpp = TRUE)
test$cal_ses_betampd(runs = 500, abundance.weighted = TRUE)
sesbeta_each[[i]] <- test$res_ses_betampd
}
# merge and reshape to generate one symmetrical matrix
test <- lapply(sesbeta_each, melt) %>% do.call(rbind, .) %>%
reshape2::dcast(., Var1~Var2, value.var = "value") %>% `row.names<-`(.[,1]) %>% .[, -1, drop = FALSE]
# like the above operation
dataset$beta_diversity[["betaNRI"]] <- test
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
t2$cal_group_distance()
g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)betaNTI(ses.betamntd) can be used to indicate the phylogenetic terminal turnover.
# null model run 500 times
t1$cal_ses_betamntd(runs=500, abundance.weighted = TRUE)
# t1$res_ses_betamntd[1:5, 1:5]| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 0 | -5.637 | -5.701 | -5.758 | -5.531 |
| -5.637 | 0 | -6.101 | -6.275 | -6.013 |
| -5.701 | -6.101 | 0 | -6.333 | -6.197 |
| -5.758 | -6.275 | -6.333 | 0 | -6.141 |
| -5.531 | -6.013 | -6.197 | -6.141 | 0 |
rcbray can be calculated using function cal_rcbray().
As an example, we also calculate the propotion of the inferred processes on the community assembly.
| process | percentage |
|---|---|
| variable selection | 4.419 |
| homogeneous selection | 48.71 |
| dispersal limitation | 0 |
| homogeneous dispersal | 8.739 |
| drift | 38.13 |
trans_func class
Ecological researchers are usually interested in the the funtional profiles of microbial communities. Because functional or metabolic data are more powerful to explain the structure and dynamics of microbial communities and to infer the underlying mechanisms. As metagenomic sequencing is complicated and expensive, using amplicon sequencing data to predict functional profiles is a good choice. Several software are often used for this goal, such as PICRUSt[19], Tax4Fun[20] and FAPROTAX[21, 22]. These tools are great to be used for the prediction of functional profiles based on the prokaryotic communities from sequencing results. Sometimes, it is also important to obtain the functions for each taxa or OTU, not just the whole profile of communities. But it is hard to know exact functions of each OTU. FAPROTAX database is a collection of the traits and characteristics of prokaryotes based on the known research results published in books and literatures. We use it to analyze the biogeochemical roles of prokaryotes in this part.
# create object of trans_func
t2 <- trans_func$new(dataset)
# mapping the taxonomy to the database
t2$cal_prok_biogeo()
# return res_prok_biogeo, 1 represent function exists, 0 represent no or cannot confirmed.
# t2$res_prok_biogeo[1:5, 1:2]| methanotrophy | acetoclastic_methanogenesis | |
|---|---|---|
| OTU_4272 | 0 | 0 |
| OTU_236 | 0 | 0 |
| OTU_399 | 0 | 0 |
| OTU_1556 | 0 | 0 |
| OTU_32 | 0 | 0 |
The percentages of the OTUs having a same trait can reflect the functional redundancy and potential of this function in the community or the specific pattern of the modules in the network.
# calculate the percentages of OTUs for each trait in each modules of network
# use_community = FALSE represent calculating module, not community, node_type_table provide the module information
t2$cal_prok_biogeo_perc(use_community = FALSE, node_type_table = network_node_type)
# we only plot some important traits, so we use the default group list to filter and show the traits.
t2$plot_prok_biogeo_perc(group_list_default = TRUE, select_samples = paste0("M", 1:10))
# M represents module, ordered by the nodes number from high to low# calculate the percentages for communities
t2$cal_prok_biogeo_perc(use_community = TRUE)
# t2$res_prok_biogeo_perc[1:5, 1:2]| methanotrophy | acetoclastic_methanogenesis |
|---|---|
| 0.39 | 0.04 |
| 0.27 | 0 |
| 0.48 | 0 |
| 0.48 | 0 |
| 0.56 | 0 |
# then we try to correlate the res_prok_biogeo_perc of communities to environmental variables
t3 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])
t3$cal_cor(add_abund_table = t2$res_prok_biogeo_perc, cor_method = "spearman")
t3$plot_corr(pheatmap = TRUE)We also calculate the functional potentials of communities on the biogeochemical cycles by using FAPROTAX software.
t2 <- trans_func$new(dataset)
# This function will invoke the python 2.7
t2$cal_biogeo()
# t2$res_biogeo[1:10, 1:3]| 1 | 2 | 3 | |
|---|---|---|---|
| methanotrophy | 26 | 15 | 16 |
| acetoclastic_methanogenesis | 2 | 0 | 0 |
| methanogenesis_by_disproportionation_of_methyl_groups | 5 | 0 | 0 |
| methanogenesis_using_formate | 0 | 0 | 0 |
| methanogenesis_by_CO2_reduction_with_H2 | 23 | 2 | 5 |
| methanogenesis_by_reduction_of_methyl_compounds_with_H2 | 0 | 0 | 0 |
| hydrogenotrophic_methanogenesis | 23 | 2 | 5 |
| methanogenesis | 25 | 2 | 5 |
| methanol_oxidation | 12 | 2 | 6 |
| methylotrophy | 38 | 17 | 22 |
Tax4Fun requires a strict input file demand on the taxonomic information. To analyze the trimmed or changed OTU data in R with Tax4Fun, we provide a link to the Tax4Fun functional prediction.
t1 <- trans_func$new(dataset)
# require install Tax4Fun and download SILVA123 ref data from http://tax4fun.gobics.de/
# decompress SILVA123; provide path in folderReferenceData as you put
t1$cal_tax4fun_func(folderReferenceData = "./SILVA123")
# return two file: tax4fun_KO: KO file; tax4fun_path: pathway file.
# t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2]| K00001; alcohol dehydrogenase [EC:1.1.1.1] | K00002; alcohol dehydrogenase (NADP+) [EC:1.1.1.2] |
|---|---|
| 0.0004823 | 5.942e-06 |
| 0.0005266 | 4.017e-06 |
| 0.0005054 | 6.168e-06 |
| 0.0005109 | 5.888e-06 |
| 0.0005083 | 5.547e-06 |
Now, we use pathway file to analyze the abundance of pathway.
# must transpose to taxa row, sample column
pathway_file <- t1$tax4fun_path$Tax4FunProfile %>% t %>% as.data.frame
# filter rownames, only keep ko+number
rownames(pathway_file) %<>% gsub("(^.*);\\s.*", "\\1", .)
# load the pathway hierarchical metadata
data(ko_map)
# create a microtable object, familiar?
func1 <- microtable$new(otu_table = pathway_file, tax_table = ko_map, sample_table = t1$sample_table)
print(func1)microtable class: sample_table have 90 rows and 4 columns otu_table have 284 rows and 90 columns tax_table have 341 rows and 3 columns
Now, we need to trim data and calculate abundance.
func1$tidy_dataset()
# calculate abundance automatically at three levels: level_1, level_2, level_3
func1$cal_abund()
print(func1)microtable class: sample_table have 90 rows and 4 columns otu_table have 284 rows and 90 columns tax_table have 284 rows and 3 columns Taxa abundance: calculated for level_1,level_2,level_3
Then, we can plot the abundance.
# bar plot at level_1
func2 <- trans_abund$new(func1, taxrank = "level_1", groupmean = "Group")
func2$plot_bar(legend_text_italic = FALSE)We can also do something else. For example, we can use lefse to test the differences of the abundances and find the important enriched pathways across groups.
func2 <- trans_diff$new(dataset = func1, method = "lefse", group = "Group", alpha = 0.05, lefse_subgroup = NULL)
func2$plot_lefse_bar(LDA_score = 3, width = 0.8)Something important
clone
R6 class has a special copy mechanism which is different from S3 and S4. If you want to copy an object completely, you should use function clone
# use clone to copy completely
t1 <- clone(dataset)
t2 <- clone(t1)
t2$sample_table <- NULL
identical(t2, t1)[1] FALSE
# this operation is usually unuseful, because changing t2 will also affect t1
t2 <- t1
t2$sample_table <- NULL
identical(t2, t1)[1] TRUE
change object
All the classes are set public, meaning that you can change, add or remove the objects stored in them as you want.
Contact
If you have any questions, please feel free to join the QQ group: 277434916 for the discussion.
References
1. Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, et al. Vegan: Community ecology package. 2019.
2. Paradis E, Schliep K. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2018; 35: 526–528.
3. Kembel S, Cowan P, Helmus M, Cornwell W, Morlon H, Ackerly D, et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics 2010; 26: 1463–1464.
4. Mcmurdie PJ, Holmes S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. Plos One 2013; 8: e61217.
5. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 2010; 7: 335–336.
6. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 2009; 75: 7537–41.
7. An J, Liu C, Wang Q, Yao M, Rui J, Zhang S, et al. Soil bacterial community structure in chinese wetlands. Geoderma 2019; 337: 290–299.
8. Mendes R, Kruijt M, De BI, Dekkers E, Van der VM, Schneider JH, et al. Deciphering the rhizosphere microbiome for disease-suppressive bacteria. Science 2011; 332: 1097–100.
9. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biology 2011; 12: R60.
10. White J, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology 2009; 5: e1000352.
11. Beck D, Foster JA. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PloS one 2014; 9: e87830.
12. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, et al. Human gut microbiome viewed across age and geography. Nature 2012; 486: 222–227.
13. Deng Y, Jiang Y-H, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC bioinformatics 2012; 13: 113.
14. Faust K, Raes J. Microbial interactions: From networks to models. Nature Reviews Microbiology 2012; 10: 538–550.
15. Coyte KZ, Schluter J, Foster KR. The ecology of the microbiome: Networks, competition, and stability. Science 2015; 350: 663–666.
16. Tackmann J, Matias Rodrigues JF, Mering C von. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell Systems 2019; 9: 286–296 e8.
17. Stegen JC, Lin X, Fredrickson JK, Chen X, Kennedy DW, Murray CJ, et al. Quantifying community assembly processes and identifying features that impose them. The ISME Journal 2013; 7: 2069–2079.
18. Liu C, Yao M, Stegen JC, Rui J, Li J, Li X. Long-term nitrogen addition affects the phylogenetic turnover of soil microbial community responding to moisture pulse. Scientific Reports 2017; 7: 17492.
19. Langille MG, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology 2013; 31: 814–21.
20. Aßhauer KP, Wemheuer B, Daniel R, Meinicke P. Tax4Fun: Predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics 2015; 31: 2882–2884.
21. Louca S, Jacques SM, Pires AP, Leal JS, Srivastava DS, Parfrey LW, et al. High taxonomic variability despite stable functional structure across microbial communities. Nature Ecology & Evolution 2016; 1: 0015.
22. Louca S, Parfrey LW, Doebeli M. Decoupling function and taxonomy in the global ocean microbiome. Science 2016; 353: 1272.